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Abstract 

In this paper we discuss numerical methods and algorithms for the solution of NLTE 
stellar atmosphere problems involving expanding atmospheres, e.g., found in novae, 
supernovae and stellar winds. We show how a scheme of nested iterations can be 
used to reduce the high dimension of the problem to a number of problems with 
smaller dimensions. As examples of these sub-problems, we discuss the numerical 
solution of the radiative transfer equation for relativistically expanding media with 
spherical symmetry, the solution of the multi-level non-LTE statistical equilibrium 
problem for extremely large model atoms, and our temperature correction proce¬ 
dure. Although modern iteration schemes are very efficient, parallel algorithms are 
essential in making large scale calculations feasible, therefore we discuss some par¬ 
allelization schemes that we have developed. 


1 Introduction 


Astronomy is sometimes described as a “passive science” since it depends 
on observations of distant objects, rather than on laboratory experiments. 
Due to both laboratory measurements of important astrophysical atomic and 
nuclear data, and advances in computational power which allow us to perform 
“numerical experiments” that situation has changed in the last 50 years, and 
astronomy has matured into the modern subject of astrophysics. Still, our 
ability to understand the nature of astronomical objects is hampered by the 
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fact that astronomical observations detect radiation that has been emitted 
primarily from the surface of objects. Thus in order to determine the structure 
of stellar objects one must solve the radiation transport equation and compare 
“synthetic spectra” with observations. The numerical solution of the radiation 
transport problems is an important prerequisite for the calculation of model 
stellar atmospheres. The simulated spectrum is then compared to observed 
spectra of objects such as stars, where the radiation is mostly emitted from 
the outer layers. In the case of very low mass stars and brown dwarfs, the 
atmosphere is also crucial in determining the interior structure of these objects 
since it serves as a boundary condition for the equations of stellar structure 
in a nearly fully convective “star”. 

Additionally, in objects such as supernovae, where the the explosion causes 
the “atmosphere” (here used to paraphrase “the region where the spectrum 
forms”) to expand rapidly, a time series of spectra reveal the entire structure 
of the object as the ejected material expands and thins and the atmosphere 
moves inward in the material. In the case of expanding objects such as hot 
stars (many with strong stellar winds), novae, and supernovae; the radiation 
transport equation must be solved simultaneously with the hydrodynamical 
equations, an even more difficult computational problem than static stars. 
We focus here on the computation of model atmospheres and the numerical 
solution of the radiation transport equation in expanding media with known 
velocity fields. This is a frequently encountered situation, e.g., when the hy¬ 
drodynamic behavior is known a priori, or can be calculated separately from 
the radiation transport by using a nested iteration scheme. The feedback be¬ 
tween detailed synthetic spectrum calculations and hydrodynamic simulations 
is often the primary tool for testing a specihc hydrodynamical model. 

Our group has developed the very general non-LTE (NLTE) stellar atmosphere 
computer code PHOENIX [1-9] which can handle very large model atoms as well 
as line blanketing by hundreds of millions of atomic and molecular lines. This 
code is designed to be both portable and very flexible: it is used to compute 
model atmospheres and synthetic spectra for, e.g., novae, supernovae, M and 
brown dwarfs, O to M giants, white dwarfs and accretion disks in Active 
Galactic Nuclei (AGN). The radiative transfer in PHOENIX is solved in spherical 
geometry and includes the effects of special relativity (including advection and 
aberration) in the modeling. The PHOENIX code allows us to include a large 
number of NLTE and LTE background spectral lines and solves the radiative 
transfer equation for each of them without using simple approximations like 
the Sobolev approximation. Therefore, the prohles of spectral lines must be 
resolved in the co-moving (Lagrangian) frame. This requires many wavelength 
points (we typically use 150,000 to 300,000 points). Since the GPU time scales 
linearly with the number of wavelength points, the GPU time requirements 
of such calculations are large. In addition, (NLTE) radiative rates for both 
line and continuum transitions must be calculated and stored at every spatial 
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grid point for each transition, which requires large amounts of storage and 
can cause signihcant performance degradation if the corresponding routines 
are not optimally coded. 

In strict LTE the radiation and matter are assumed to be in equilibrium 
with each other everywhere throughout the atmosphere. In LTE the source 
function is assumed to be given by the Planck function (see below). In NLTE, 
the radiation is no longer assumed to be in equilibrium with the matter and 
hence the full coupling between matter and radiation must be calculated in 
order to calculate the source function. 

We concentrate here on the calculation of model atmospheres for expanding 
media and, in addition, describe some of the important parts of the numerous 
numerical algorithms used in PHOENIX; the numerical solution of the radia¬ 
tion transport equation, the non-LTE rate equations, and the parallelization 
of the code. An important problem in these calculations is to hud a consis¬ 
tent solution of the very diverse equations that describe the various physical 
processes. We have developed a scheme of nested iterations that enables us 
to separate many of the variables (e.g., separating the temperature correc¬ 
tion procedure from the calculation of the NLTE occupation numbers). This 
allows us to compute far more detailed stellar atmosphere models than was 
previously possible. We will give an outline of these methods in this paper. 

In order to take advantage of the enormous computing power and vast ag¬ 
gregate memory sizes of modern parallel supercomputers, both potentially 
allowing much faster model construction as well as more sophisticated mod¬ 
els, we have developed a parallel version of PHOENIX. Since the code uses a 
modular design, we have implemented different parallelization strategies for 
different modules (e.g., radiative transfer, NLTE rates, atomic and molecular 
line opacities) in order to maximize the total parallel speed-up of the code. In 
addition, our implementation allows us to change the distribution of compu¬ 
tational work onto different nodes both via input hies and dynamically during 
a model run, which gives a high degree of hexibility to optimize performance 
for both a number of different parallel supercomputers (we are currently using 
IBM SP2s, SGI Origin 2000s, HP/Convex SPP-2000s, and Cray T3Es) and 
for different model parameters. Since PHOENIX has both large CPU and mem¬ 
ory requirements we have developed the parallel version of the code using a 
MIMD approach. We use the MPI message passing library [10] for portability 
and simultaneously use both task and data parallelism in order to optimize 
the parallel speed-up [8,9]. 


2 The Problem 
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2.1 Overview 


Our goal (for the purposes of this paper) is to construct self-consistent models 
of expanding stellar atmospheres. The atmosphere itself is parameterized by 
a number of parameters, e.g., the total energy emitted by the object (lumi¬ 
nosity L), the mass M of the star, the abundances of the elements (in some 
cases as function of the location in the atmosphere). This means that we have 
to hnd a set of physical variables such as temperatures, densities, population 
number of each atomic energy level and the radiation held, at each location in 
the atmosphere so that all constraint equations are simultaneously fulhlled. In 
Fig. 1 we show this requirement in a simplihed graphical form where the ar¬ 
rows indicate direct (usually non-linear) coupling between the different blocks. 
The number of variables that need to be addressed is, formally, very large. A 
typical case of a spherically symmetric shell model with 50 radial points re¬ 
quires a set of 50 temperatures and gas pressures (or matter densities). In 
addition to this we include a set of about 6000 individual energy levels for 
atoms and ions directly, the population of each must be known at every radial 
point, adding a total of 300,000 variables. In order to calculate the population 
numbers, we need a description of the radiation held at each radial point and 
on a set of wavelength points (the rates that govern the transitions between 
atomic energy levels are integrals of the mean intensity of the radiation held 
over wavelength). We typically need 100,000 to 300,000 wavelength point to 
describe the complete radiation held, which adds, in the worst case, 15 million 
variables to the system. 

Fortnnately, most of these formal variables are tightly coupled to a much 
smaller set of variables which we might, therefore, consider the “fnndamental” 
variables of the model atmosphere problem. In our approach, these fundamen¬ 
tal variables are the temperatures T, the gas pressures Pgas, and the population 
numbers Uj at each radial point r*. The radiation held is considered a “derived” 
quantity and the problem is thus reduced to hnd a set of physical variables 
{T, Pgas,ni} at each radial point i so that the system outlined in Fig. 2 is 
self-consistent. With this approach we have reduced the number of variables 
from several million to a few 100 thousand, which is still a dannting nnmber. 

Although it is possible to analytically bring the system into a form so that 
it could be solved by a Newton-Raphson approach [11], this idea is computa¬ 
tionally prohibitive because of its enormous memory and time reqnirements 
(however, for smaller systems this approach has been used successfully). Fur¬ 
thermore, this approach is complex to implement and it is relatively hard 
to add more “physics” to the model atmosphere. We have thus developed a 
scheme of nested iterative solutions that considers the direct (or strong) con- 
plings between important variables directly and iteratively acconnts for the 
indirect coupling between sets of variables. With this approach the problem 
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of constructing the model atmosphere can be separated into solving a large 
number of smaller problems with only a few 100 variables. The global require¬ 
ment of a self-consistent solution is then reach by iteratively coupling these 
sets of variables to each other until a prescribed accuracy has been reached. 
This method works because the level of coupling between the variables is very 
different. For example, the temperature structure of the atmosphere depends 
mostly on the global constraint of energy conservation (represented by wave¬ 
length integrals over the whole spectrum) and on the ratios of several averaged 
opacities, but it does not depend strongly on the hne details of the radiation 
held or the individual population of the vast majority of the atomic levels. 
Therefore, correction to the temperature structure can be calculated approx¬ 
imately. The current errors of, e.g., the energy conservation equations, must 
be calculated exactly in order to this scheme to function, however, this is rela¬ 
tively simple. The general idea of calculating errors exactly but corrections to 
the variables approximately will work if the approximations are good enough 
for the scheme to converge at all. This method will require more iterations to 
reach convergence but this is more than offset by faster individual iterations 
and (very often) by better robustness. The latter is very important if many 
model atmospheres have to be constructed or if no good initial guesses for the 
the variables are known. 

In the following sections we will concentrate on a few key parts of the ex¬ 
panding atmosphere problem; the numerical solution of the radiative transfer 
equation for a single wavelength point (this will deliver the radiation field for a 
given set of variables at every wavelength), the solution of the NLTE statistical 
equilibrium equations (coupling the radiation field to the level populations), 
and an outline of the temperature correction procedure. The latter is impor¬ 
tant because it allows us to solve the NLTE statistical equilibrium equations 
separately for individual elements (and even ionization stages), which dramat¬ 
ically reduces the dimension of the sub-problems that have to be solved within 
the global nested iteration scheme. In this paper we will not discuss problems 
related to the hydrodynamics of the expanding medium or the details of the 
equation of state calculations, both of which are important topics. 


2.2 Radiative transfer in expanding media 


The equation of radiative transfer (RTE) in spherical symmetry for moving 
media has been solved with a number of different methods, e.g. Monte Carlo 
calculations [12-14], Sobolev methods [15], the tangent ray method [16], and 
the DOME method [17]. Only the tangent ray and the DOME method have 
been used to solve the RTE for very fast expanding shells (e.g. supernovae 
or novae) including the necessary special relativistic terms. Both methods 
need relatively large amounts of CPU time to compute the radiation held. 
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mainly because of the need for matrix inversions (tangent ray method) or 
matrix diagonalization (DOME), which make both of them impractical for 
use within radiation-hydrodynamic studies of nova or supernova explosions. 
It has been shown [18] that the special relativistic terms in the RTE can be 
very important, even in the optically thick regions of expanding shells, and 
lead to results different than from the simpler approach which simply neglects 
the relativistic terms. 


Recently, iterative methods for the solution of the RTE have been developed, 
based on the philosophy of operator perturbation [19,20]. Following these 
ideas, different approximate A-operators for this “accelerated A-iteration” 
(ALI) method have been used successfully [21-23] and have been applied to the 
construction of non-LTE, radiative equilibrium models of stellar atmospheres 
|23], 


We describe the use of the short-characteristic method [21,24] to obtain the for¬ 
mal solution of the special relativistic, spherically symmetric radiative transfer 
equation (SSRTE) along its characteristic rays and then use a band-diagonal 
approximation to the discretized A-operator [1,24,25] as our choice of the ap¬ 
proximate A-operator. This method can be implemented very efficiently to 
obtain an accurate solution of the SSRTE for continuum and line transfer 
problems using only modest amounts of computer resources. 


The co-moving frame radiative transfer equation for spherically symmetric 
flows can be written as [26]: 
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P = v/cis the velocity in units of the speed of light, c; and 7 = (1 — 

is the usual Lorentz factor. Equation 1 is a integro-differential equation, since 

the emissivity contains Jy, the zeroth angular moment of 1^. 

rfy = KySy + ayjy + ^ / 0^J,, c^z/, 

lines 


with 


1 

J, = 1/2 J djj I,, ( 2 ) 

-1 

where S^, is the source function, k.^, is the absorption coefficient, is the 
scattering coefficient for continuum processes, ai are the line scattering coef- 
hcients, and <pi is the line prohle function. The independent variables are the 
radius r of the shell, the cosine fi of the angle between the radial direction and 
the propagation vector of the light (with /i = — 1,1 for radially inward and 
outward moving light, respectively), and the frequency z/ = c/A for a wave¬ 
length A of the light. With the assumption of time-independence, ^ = 0, and 
a monotonic velocity held Eq. 1 becomes a boundary-value problem in the 
spatial coordinate and an initial value problem in the frequency or wavelength 
coordinate. 

Switching from frequency to wavelength (Eq. 1 is presented in Ref. [1] in 
wavelength), the mean intensity Jx is obtained from the source function S\ 
by a formal solution of the RTE which is symbolically written using the A- 
operator Aa as 


Jx = AxSx. 


( 3 ) 


In the case of the transition of a two-level atom, we have 


J = AS, (4) 

where J = J (l){\)Jxd\, A = f </)(A)AxdA with the normalized line prohle 
0(A). The line source function, for the simple case of a two-level atom without 
continuum and background absorption or scattering, is given by S' = (1 — 
e) J -|- eB, where e denotes the thermal coupling parameter and B is Planck’s 
function. 

The A-iteration method, i.e. to solve Eq. 4 by a hxed-point iteration scheme 
of the form 
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fails in the case of large optical depths and small e. This result is caused by 
the fact that the largest eigenvalue of the amplihcation matrix (in the case of 
Doppler-prohles) is approximately [16] Amax ~ (1 ~ ^)(1 ~ T~^), where T is 
the optical thickness of the medium. For small e and large T, this is very close 
to unity and, therefore, the convergence rate of the A-iteration is very poor. 
A physical description of this effect can be found in Mihalas [27]. 


2.2.1 The operator splitting method 

The idea of the ALT or operator splitting method is to reduce the eigenvalues 
of the amplihcation matrix in the iteration scheme [19] by introducing an 
approximate A-operator (ALO) A* and to split A according to 

A = A* + (A - A*) (6) 

and rewrite Eq. 4 as 

J„ew = A*^new + (A “ A*)S'old. (7) 

This relation can be written as [22] 

[1 - A*(l - e)] Jnew = 4 - A*(l - e)4id, (8) 

where Jfs = AAom- Equation 8 is solved to get the new values of J which is 
then used to compute the new source function for the next iteration cycle. 

Mathematically, the ATI method belongs to the same family of iterative meth¬ 
ods as the Jacobi or the Gauss-Seidel methods [28]. These methods have the 
general form 


Mx^+^ = Nx^ + b (9) 

for the iterative solution of a linear system Ax = b where the system matrix 
A is split according to A = M — A^. In the case of the ALI method we have 
M = 1 — A*(l — e) and, accordingly, N = (A — A*)(l — e) for the system 
matrix A = 1 — A(1 — e). The convergence of the iterations depends on the 
spectral radius, p{G), of the iteration matrix G = M~^N. For convergence 
the condition p{G) < 1 must be fulhlled, this puts a restriction on the choice 
of A*. In general, the iterations will converge faster for a smaller spectral 
radius. To achieve a signihcant improvement compared to the A-iteration, the 
operator A* is constructed so that the eigenvalues of the iteration matrix G 
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are much smaller than unity, resulting in swift convergence. Using parts of 
the exact A matrix (e.g., its diagonal or a tri-diagonal form) will optimally 
reduce the eigenvalues of the G. The calculation and the structure of A* should 
be simple in order to make the construction of the linear system in Eq. 8 
fast. For example, the choice A* = A is best in view of the convergence rate 
(it is eqnivalent to a direct solution by matrix inversion) but the explicit 
construction of A is more time consnming than the constrnction of a simpler 
A*. The solntion of the system Eq. 8 in terms of linear algebra, using modern 
linear algebra packages such as, e.g., LAPACK, is so fast that its CPU time can 
be neglected for the small number of variables encountered in ID problems 
(typically the nnmber of discrete shells is abont 50). However, for 2D or 3D 
problems the size of A gets very large dne to the mnch larger nnmber of grid 
points as compared to the ID case. Matrix inversions, which are necessary to 
solve Eq. 8 directly, therefore become extremely time consnming. This makes 
the direct solntion of Eq. 8 more CPU intensive even for A*’s of moderate 
bandwidth, except for the trivial case of a diagonal A*. Different methods like 
modihed conjngate gradient methods [29] may be effective for these 2D or 3D 
problems. 

The CPU time reqnired for the solntion of the RTE using the ALI method 
depends on several factors: (a) the time reqnired for a formal solntion and 
the compntation of Jfs, (b) the time needed to constrnct A*, (c) the time 
reqnired for the solntion of Eq. 8, and (d) the nnmber of iterations reqnired 
for convergence to the prescribed accnracy. Points (a), (b) and (c) depend 
mostly on the nnmber of discrete shells, and can be assnmed to be hxed 
for any given conhgnration. However, the nnmber of iterations reqnired to 
convergence depends strongly on the bandwidth of A*. This indicates, that 
there is an optimum bandwidth of the A*-operator which will resnlt in the 
shortest possible CPU time needed for the solntion of the RTE, which we will 
discuss below. 


2.2.2 Computation of A* 

The formal solution of the SSRTE is performed along the characteristic rays 
on a mesh {rj}, i = 1,... ,Ns of discrete shells using the short-characteristic 
(SC) method of Olson and Knnasz [24] with piece-wise parabolic or linear 
interpolation. The characteristic rays are curved in the case of the SSRTE and 
have to be calculated before the solution of the radiative transfer eqnation 
proceeds (see [1] for details). Improvements in this method [1,25] inclnde an 
improved angle integration using generalized Simpson-qnadratnre and a gen¬ 
eralization of the approximate A-operator to an arbitrary nnmber of bands 
below and above the main diagonal (np to the fnll A-operator). 

We describe here the general procednre of calculating the A* with arbitrary 
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bandwidth, up to the full A-operator, for the SC method in spherical symmetry 
[25]. Although we consider the SSRTE as given in the previous section, the 
same procedure applies for radiative transfer problems in static media or in 
(static or moving) media with plane-parallel symmetry. The specialization of 
the formulae given in this section is straightforward. 

The formal solution along a characteristic of the SSRTE (hereafter, a “ray”) is 
done using a polynomial interpolation of the source function, S, along the ray. 
For reasons of numerical stability, we use linear or quadratic interpolation of S 
along each ray, although this is not required by the method. This leads to the 
following expressions for the specihc intensity /(pj) along a ray (cf. Ref. [24] 
for a derivation of the formulae): 


' i 

^ exp(rf_i - rf ) + J S{t) exp(rf_i - r) dr, (10) 

jf^Jtiexp(-Arf_i)+AJf, 

where the superscript k labels the ray; rf denotes the optical depth along the 
ray k with = 0 and r|Li < rf while is calculated, e.g., using piecewise 
linear interpolation of x along the ray, viz. 


Arf_i = {xi-i + Ai)kti - sf 1/2. (11) 

and 


A/‘=yS._,+/3‘S, + 7*S.+i, (12) 

where i is the “running” index along the ray and ~ .sf I is the geometrical 
path length between points i and i — 1. The expressions for the coefficients 
and yf are given in Ref. [24] (see also Ref. [1]). 

We describe the construction of A* for arbitrary bandwidth using the example 
of a characteristic that is tangential to an arbitrary shell; Ray k is the ray 
that is tangent to shell k + 1 The intersection points (including the point of 
tangency) are labeled from left to right, the direction in which the formal 
solution proceeds. Ray k has 2A; -|- 1 points of intersection with discrete shells 
1.. .k + 1. To compute row j of the discrete A-operator (or A-matrix), Aj^, 
we sequentially label the intersection points of the ray k with the shell i, and 
define auxiliary quantities Ah and Ah as follows: 
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\k 

\j 

= 0 

for i < j — 1 

k 

= ^U 

for i = j — 1 

\k 

= Aj_i,,.exp(-Ar^,)+/3^ 

for i = j 

k 

‘i+ij 

= Ajj exp(-Arj') + 

for i = j -|- 1 


= Ati^^.exp(-Arf_i) 

for J -I- 1 < i < /c -I- 1 


( 13 ) 


For the calculation of Afj, we obtain: 
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for i 

= k 
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\k 

^ip 

= 
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for k 

+ j 

+ 5 

< 

i <2k + 1 


Using the Ah and Ah, we can now write the A-Matrix as 


k 


Vw in / 


(15) 


where wfj are the angular quadrature weights, {/} is the set {i < k + 1} and 
{/'} is the set {i > k + 1}. This expression gives the full A-matrix, it can easily 
be specialized to compute only certain bands of the A-matrix. In that case, 
not all of the Afj and Afj have to be computed, reducing the CPU time from 
that required for the computation of the full A-matrix. 


2.2.3 Numerical Considerations 

The calculation of A* using the algorithm outlined can be vectorized and 
parallelized with respect to the ray index k and the row index j for any given 
bandwidth of A*. In addition, quantities like exp(—Ar^^^), af, and yf 
can be pre-calculated and stored, a process which is fully vectorizable and 
parallelizable. 

For each point on a ray, the computation of the specihc intensity uses about 7 
floating point operations (flops), whereas the computation of the j and j 
takes only 1 flop per intersection point. In addition, about 3 flops are needed 
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for the integration over the angle coordinate jj, in order to compute the mean 
intensities J and the A*-operator. We have to calculate the formal solution for 
7 Vt(A^t + 1) + At + 2NsNc points, where Ag is the number of discrete shells, 
Ac is the number of core intersecting characteristics and At = Ag — 1 is the 
number of tangent rays. Therefore, the number of flops required for the compu¬ 
tation of the specihc intensities at all points is ~ 10[(Ag-|-l)(Ag —l)-|-2AgAc]. 
To estimate the number of flops required for the calculation of a A*-operator 
with a bandwidth of Ab < Ag, we assume that each point of a ray has Ab 
nearest neighbors, thus overestimating the number of operations. In this ap¬ 
proximation, we have to compute < AbAt(At -|- 2 ) -|- 2N-qN^Nc auxiliary 
variables j or Afj. Therefore, about < 4AB[(Ag — l)(Ag-|-l)-|-2AgAc] float¬ 
ing point operations are needed to compute the A*-operator and the ratio of 
the numerical work needed for the computation of a A*-operator with a band¬ 
width of Ab and one formal solution is of the order of 2Ab/5. This expression 
actually significantly overestimates the number of operations required for the 
construction of the A* operator, in particular for larger bandwidths (the ef¬ 
fects of the boundaries become more important for larger bandwidths). For 
example, according to this estimate the computation of the full A-matrix for 
Ag = 50 takes about the same time as 20 formal solutions, however, the actual 
time used for the construction of the full A-matrix corresponds only to about 
6 formal solutions on many machines. This indicates that the number of it¬ 
erations must be rather small in order to make ALO’s with small bandwidth 
competitive in terms of speed for the solution of radiative transfer problems 
and that the initial guess for the source function will have a large influence 
on the optimum bandwidth. The best strategy is to use monitoring to pre¬ 
dict the “optimum” bandwidth that gives the shortest time for the solution 
of the SSRTE at any given wavelength point in an “adaptive bandwidth op¬ 
erator splitting” method, see Ref. [25] for details and results for a number of 
machines. 

In order to accelerate convergence the Ng method [30] or the Orthomin method 
[31] may be used (see Auer [32] for a review of different acceleration methods). 
These methods can cut down the number of iterations required to reach a 
prescribed accuracy by a factor of two or more with only a small increase in 
computational overhead. 


2.3 NLTE calculations 


In order to solve Eq. 1, the emissivity rjx must be known, but rjx depends on 
the NLTE level populations and therefore the NLTE rate equations must be 
solved simultaneously with Eq. 1. This is further complicated by the fact that 
the NLTE rate equations depend on the radiation field itself. The NLTE rate 
equations have the form [11] 
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( 16 ) 


j<i [i<*V^*/ j>i 

+ XI % f X) (-^i* + ^ij) = 0- 

i>i \'*i/ 

In Eq. 16, n* is the actual, non-LTE population density of a level i and the 
symbol n* denotes the so-called LTE population density of the level i, which 
is given by 


^ gi 2h^ne ( Ei - E^\ 

Here denotes the actual, i.e., non-LTE, population density of the ground 
state of the next higher ionization stage of the same element; gi and g^ are 
the statistical weights of the levels i and k, respectively. In Eq. 17, Ei is the 
excitation energy of the level i and E^ denotes the ionization energy from the 
ground state to the corresponding ground state of the next higher ionization 
stage. The actual, non-LTE electron density is given by rie- The system of rate 
equations is closed by the conservation equations for the nuclei and the charge 
conservation equation (cf. Ref. [11]). 



The sums in Eq. 16 extend only over the levels that are included in our model 
atoms; for example, in singly ionized iron our model atom consists of 675 en¬ 
ergy levels [3]. The weaker radiative transitions are treated as LTE background 
opacity (see Refs. [2,3]). 


The rate coefficients for radiative and collisional transitions between two levels 
i and j (including transitions from and to the continuum, see below) are given 
by Rij and Cij, respectively. In our notation, the upward (absorption) radiative 
rate coefficients Rij {i < j) are given by 


R^, = ^Ja,,{\)Jx{X)XdX, (18) 

0 

whereas the downward (emission) radiative rate coefficients Rji {i < j) are 
given by 


Here, J is the mean intensity, T the electron temperature, h and c and Planck’s 
constant and the speed of light, respectively. For the purposes of this paper. 
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we assume that cross section aij(A) of the transition i —j at the wavelength 
A is known for both line and continuum transitions and that it is the same for 
both absorption and emission processes (complete redistribution). 

Not all atomic processes £t neatly into the above scheme where the rates are in 
detailed balance. Non-thermal ionization by fast electrons, K-capture, Auger 
emission, and two-photon decay are important in various stages of the evolu¬ 
tion of novae and supernovae. They can be included in the above formulation 
with reasonable approximations, however. 


2.3.1 The Rate Operator 

As described above, a simple hxed point iteration scheme for the solution of 
the rate equations will converge much too slowly to be useful for most cases 
of practical interest. Therefore, we use an extension of the operator splitting 
idea for the solution of the rate equations. 

We rewrite the rate equations in the form of an “operator equation.” This 
equation is then used to introduce an “approximate rate operator” in analogy 
to the approximate A-operator which can then be used to iteratively solve the 
rate and statistical equations by an operator splitting method, details of the 
approach are given in [2]. 

We introduce first the “rate operator” [Rij] for upward transitions in analogy 
to the A-operator. [Rij] is defined so that 

R,, = [R,,][n]. (20) 

Here, [n] denotes the “population density operator”, which can be considered 
as the vector of the population densities of all levels at all points in the medium 
under consideration. The radiative rates are (linear) functions of the mean 
intensity J, which is given by J(A) = A(A)S'(A), where S = ? 7 (A)/x(A) is the 
source function. Using the A-operator, we can write [i?q][n] as: 


4:71 f 

[R^M = — J a,,{X)A{X)S{X) XdX. 

This can be brought into the form (see [2] for details) 


( 21 ) 


[Rij] [n 


dTT 

he 


J aij{X)^{X)E{X) XdX 
.0 


n . 


( 22 ) 


The corresponding expression for the emission rate-operator [Rji] is given by 
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I^ + «'(^)[-E(A)|[n|| exp \d\ (23) 


where we have used the dehnition 


A(A) = ^(A)/x(A), (24) 

and [-E(A)] is a /mear operator such that [i?(A)][n] gives the emissivity 77 (A). 
Using the rate operator, we can write the rate equations in the form 


E % ([Rd [n] +Cji) - n J ^ V [Rij] [n] + Cji) 

j<i 

+ X! ([-^*i]N + ^*i) 1 (2^) 

j>i ) 

+ fui) i[Rji\d] + ^ij) = 0- 

j>i \^j / 

This form shows, explicitly, the non-linearity of the rate equations with re¬ 
spect to the population densities. Note that in addition, the rate equations 
are non-linear with respect to the electron density via the collisional rates. 
Furthermore, the charge conservation constraint condition directly couples 
the electron densities and the population densities of all level of all atoms and 
ions with each other. 

In analogy to the operator splitting method discusses above, we split the rate 
operator, by writing [Rij] = [R*j] + {[Rij] - [R*j]) = [R*j] + [Ai?^] (analog for 
the downward radiative rates), where [R^ is the “approximate rate-operator”. 
We then rewrite the rate Rij as 


Rij = [Rd [rinew] + [^Rij] [’^old] (26) 

and analogously for the downward radiative rates. In Eq. 26, [uoid] denotes 
the current (old) population densities, whereas [unew] are the updated (new) 
population densities to be calculated. The [Rij] and [Rji\ are linear functions 
of the population density operator [n^] of any level k, due to the linearity of 
f] and the usage of the T-operator instead of the A-operator. 

If we insert Eq. 26 into Eq. 16, we obtain the following system for the new 
population densities: 
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^ ^ ^j,new [R 
j<i 


'jiJ ['^newj "^2,new ^ ^^ ^ 


E B [^y[^-w] + E[^yK 


J<i 


j>i 


n. 


+ E %,new B i«yi .^new] + E ^j,new ([A%][noid]+C',,) (27) 




n 


j<i 


Ua 


~'^i,new ^ E 1 ^ ) ([^-^*i] [^old] + Cji) + E ([^old] + Cij] 
j<i \^i / j>i 

^ / ^7.new ^ ([A-Rj j] [?^old] “1“ ^ij') O' 


j>i 


n 


Due to its construction, the [R*j]-operator contains information about the in¬ 
fluence of a particular level on all radiative transitions. Therefore, we are able 
to treat the complete multi-level non-LTE radiative transfer problem includ¬ 
ing active continua and overlapping lines. The [i7(A)]-operator, at the same 
time, gives us information about the strength of the coupling of a radiative 
transition to all levels that are considered. This information may be used to in¬ 
clude or neglect certain couplings dynamically during the iterative solution of 
Eq. 27. Furthermore, we have not yet specihed either a method for the formal 
solution of the radiative transfer equation or a method for the construction 
of the approximate A-operator (and, correspondingly, the [R*j]-operator). We 
proceed by considering rapidly expanding spherically symmetric media and 
use the tri-diagonal ALO given by Hauschildt [1]. However, any method for 
the formal solution of the radiative transfer equation and the construction of 
the ALO may be used, including multi-dimensional and/or time dependent 
methods. 


2.3.2 Solution of the statistical equations 

The system Eq. 27 for [unew] is non-linear with respect to the Uj^new and Ue 
because the coefficients of the [/?))■] and [R*J-operators are quadratic in Uj^new 
and the dependence of the Saha-Boltzmann factors and the collisional rates 
on the electron density, respectively. The system is closed by the abundance 
and charge conservation equations. To simplify the iteration scheme, and to 
take advantage of the fact that not all levels strongly influence all radiative 
transitions, we use a linearized and splitted iteration scheme for the solution 
of Eq. 27. This scheme has the further advantage that many different elements 
in different ionization stages and even molecules can be treated consistently. 
A problem where this is important is the modeling of nova and supernova 
atmospheres, where there are typically very large temperature gradients within 
the line forming region of the atmosphere. 

To linearize Eq. 27, we follow [33] and replace terms of the form nj^new[Rji\ [^^new] 
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in Eq. 27 by nj^oid[Rji\[nnew]- 


^V'.old [Rji] [^new] ^i,old | ^ [^new] + T"! [Rji] [^new] i 

j<i [i<iV'b/ j>i J 

+ ^ ^j,old ( [^new] + ^ nj,new {[^Rji] [^old] + Cji) (28) 

j>i \^j / j<i 

~^i,new I ^ ^ [^old] + Cji) + ^ [^old] + Cij) > 

[j<i\^ij j>i J 

C'^^j,new ( ~7 ) [^old] + Cij) = 0. (29) 

j>i \^j / 

This removes the major part of the non-linearity of Eq. 27 but the modihed 
system is still non-linear with respect to Ue and still has the high dimension¬ 
ality of the original system. However, as has been noted before, not all levels 
are strongly coupled to all other levels. Equation 29 can be solved for each 
element (or groups of elements if they are coupled tightly) separately if the 
electron density is given. Therefore, we split the electron density calculation 
from the rate equation solution so that the Ue can be considered as given dur¬ 
ing the rate equation solution process and changes in the electron density are 
then accounted for in an outer iteration to hnd a consistent solution of the 
rate equations and the electron densities. 

The most important advantage of this method is that it requires the solution 
of large linear systems and low-dimensional non-linear system (for the elec¬ 
tron density). Thus, its solution is more stable and uses much less computer 
resources (time and memory) than the direct solution of the original non¬ 
linear equations. This allows us to treat many more levels with this method 
then with more conventional algorithms. Using a nested iteration scheme like 
the one described here will slow down the convergence of the iterations, but 
this is more than offset for by the possibility of calculating much larger mod¬ 
els with less memory. Since we are able to solve a separate equation for each 
group of elements, we can trivially parallelize the solution by distributing the 
groups among the available processors. Convergence acceleration methods can 
in principle be used, but they frequently lead to convergence instabilities in 
the nested iterations for the solution of the statistical equilibrium equations. 

We have so far assumed that the electron density Ue is given. Although this 
is a good assumption if only trace elements are considered, the electron den¬ 
sity may be sensitive to non-LTE effects. This can be taken into account 
by using either a fixed point iteration scheme for the electron density or, if 
many species or molecules are included in the non-LTE equation of state, 
by a modihcation of the LTE partition functions to include the effects of 
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non-LTE in the ionization equilibrium. The latter method replaces the par¬ 
tition function, Q = 9 i^^p{~Ei/kT), with its non-LTE generalization, 

Qnlte = Z]^j 5 'jexp(—Ej/ZcT), and uses Qnlte in the solution of the ion¬ 
ization/dissociation equilibrium equation. We use this method because of the 
large number of elements with various ionization stages as well as molecules 
and condensation of dust grains included in statistical equilibrium calculations 
(and not all of them in non-LTE). 

Our iteration scheme for the solution of the multi-level non-LTE problem can 
be summarized as follows: (1) for given n* and Ug, solve the radiative transfer 
equation at each wavelength point and update the radiative rates and the 
approximate rate operator, (2) solve the linear system Eq. 29 for each group 
for a given electron density, (3) compute new electron densities (by either 
hxed point iteration or the generalized partition function method), (4) if the 
electron density has not converged to the prescribed accuracy, go back to 
step 2, otherwise go to step 1. The iterations are repeated until a prescribed 
accuracy for the Ug and the n* is reached. It is important to account for 
coherent scattering processes during the solution of the wavelength dependent 
radiative transfer equation, it explicitly removes a global coupling from the 
iterations. 


2-4 Temperature eorrection proeedure 


In the outermost level of the nested iteration scheme we also iterate for the 
temperature structure of the atmosphere using a generalization of the Unsold- 
Lucy temperature correction scheme to spherical geometry and NLTE model 
calculations. This has proven to work very well even in extreme NLTE cases 
such as nova and supernova atmospheres. The temperature correction proce¬ 
dure also requires virtually no memory and CPU time overheads. The Unsold- 
Lucy correction scheme (see Mihalas [34] for a discussion of this and other 
temperature correction schemes), uses the global constraint equation of en¬ 
ergy conservation to hnd corrections to the temperature that will fulhll energy 
conservation better than the previous temperatures. We have found it to be 
more stable than a Newton-Raphson linearization scheme and it allows us to 
separate the temperature corrections from the statistical equations discussed 
above. 

To derive the Unsold-Lucy correction, one uses the fact that the ratios of the 
wavelength averaged absorption and extinction coefficients 


Kp 


( CXJ \ 

J kxBi dx] /B 


(30) 
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«J = 


oo 


(31) 


Xj 



P 

IF 


(32) 

(33) 


(where B, J,F denote the wavelength integrated Planck function, mean in¬ 
tensity and radiation flux, respectively) depend much less on values of the 
independent variables than do the averages themselves. 


Dropping terms of order (v/c), one can then use the angular moments of the 
SSRTE to show that in order to obtain radiation equilibrium B should be 
corrected by an amount 


5B{t) = — {kjJ — kpB + S/{At[)} (34) 

Kp 

T 

- I qXF(H(T)-H„{T))], 

0 


where H = F/An, Hq{t) is the value of the target luminosity at that par¬ 
ticular depth point (variable due to the velocity terms in the SSRTE and 
non-mechanical energy sources, the total observed luminosity iPo(O) is an in¬ 
put parameter). Here, q is the “sphericity factor” given by 


Q = 


3/-1 

rf 


where /(r) = K{t)/J{t) is the “Eddington factor” and K = f dfj, is 
the second angular moment of the mean intensity. S describes all additional 
sources of energy such as mechanical energy supplied by winds or non-thermal 
ionization due to y-ray deposition. 

The hrst term in Eq. 35 corresponds simply to a A iteration term and will thus 
provide too small temperature corrections in the mner parts of the atmosphere 
(but work hne in the outer, optically thin parts). The second term of Eq. 35, 
however, is the dominant term in the inner parts of the atmosphere. It provides 
a very good approximation to the temperature corrections AT deep inside the 
atmosphere. Eollowing [35], we found that it is sometimes better to modify 
this general scheme by, e.g., excluding the contributions of extremely strong 
lines to the opacity averages used in the AT calculations because they tend 
to dominate the average opacity but do not contribute as much to the total 
error in the energy conservation constraint. 
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2.5 Global iteration scheme 


As the first step in our outermost iteration loop (the “model iteration”) we use 
the current best guess of {T, Ui] as function of radius to solve the hydrostatic 
or hydrodynamic equations to calculate an improved run of Pgas with radius. 
Simultaneously, the population numbers are updated to account for changes 
in Pgas- The next major step is the computation of the radiation held for each 
wavelength point (the “wavelength loop”), which has the prerequisite of a 
spectral line selection procedure for LTE background lines. Immediately after 
the radiation held at any given wavelength is known, the radiative rates and 
the rate operators are updated so that their calculation is hnished after the last 
wavelength point. In the next steps, the population numbers are updated by 
solving the rate equations for each NLTE species and new electron densities are 
computed, this gives improved estimates for {nj}. The last part of the model 
iteration is the temperature correction scheme outlined above (using opacity 
averages etc. that were computed in the wavelength loop) which delivers an 
improved temperature structure. If the errors in the constraint equations are 
larger than a prescribed accuracy, the improved {T, ni} are used in another 
model iteration. Using this scheme, about 10-20 model iterations are typi¬ 
cally required to reach convergence to better than about 1% relative errors, 
depending on the quality of the initial guess of the independent variables and 
the complexity of the model. 


3 Parallelization 


Solving the above set of coupled non-linear equations for large numbers of 
NLTE species requires large amounts of memory to store the rates for each 
level in all the model atoms at each radial grid point, and large amounts of 
CPU time because many wavelength points are required in order to resolve 
the line prohles in the co-moving frame. In order to minimize both CPU and 
memory requirements we have parallelized the separate Fortran 90 modules 
which make up the PHOENIX code. Our experience indicates that only the 
simultaneous use of data and task parallelism can deliver reasonable parallel 
speedups [8]. This involves: 

(1) The radiative transfer calculation itself, where we divide up the char¬ 
acteristic rays among nodes and use a “reduce” operation to collect and 
send the Jy to all the radiative transfer and NLTE rate computation tasks 
(data parallelism); 

(2) the line opacity which requires the calculation of up to 50,000 Voigt 
prohles per wavelength point at each radial grid point, here we split the 
work amongst the processors both by radial grid points and by dividing 
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up the individual lines to be calculated among the processors (combined 
data and task parallelism); and 

(3) the NLTE calculations. The NLTE calculations involve three separate 
parts: the calculation of the NLTE opacities, the calculation of the rates 
at each wavelength point, and the solution of the NLTE rate and statisti¬ 
cal equilibrium equations. To prevent communication overhead, each task 
computing the NLTE rates is forced to be on the same node with the cor¬ 
responding task computing NLTE opacities and emissivities, (combined 
data and task parallelism). The solution of the rate equations parallelizes 
trivially with the use of a diagonal approximate rate operator. 

In the latest version of our code, PHOENIX 9.1, we have incorporated the 
additional strategy of distributing each NLTE species (the total number of 
ionization stages of a particular element treated in NLTE) on separate nodes. 
Since different species have different numbers of levels treated in NLTE (e.g. 
Fe II [singly ionized iron] has 617 NLTE levels, whereas H I has 30 levels), care 
is taken to balance the number of levels and NLTE transitions treated on each 
node to avoid unnecessary synchronization and communication problems. We 
have also parallelized the selection of background atomic and molecular LTE 
lines (a signihcant amount of work considering that our combined line lists 
currently include about 400 million lines and we expect line lists with about 1 
billion lines in the near future). Although the line selection seems at first glance 
to be an inherently serial process, since a hie sorted in wavelength with selected 
lines must be written to disk, we are able to obtain reasonable speedups, by 
employing a client-server model with a server line-selection task which receives 
the selected lines and writes them to disk and client nodes which read pieces 
(blocks) of the line list hies and carry out the actual selection processes on 
each block of lines. 

In addition to the combined data and task parallelism discussed above, PHOENIX 
also uses simultaneous explicit task parallelism by allocating diherent tasks 
(e.g., atomic line opacity, molecular line opacity, radiative transfer) to diherent 
nodes. This can result in further speed-up and better scalability but requires 
a careful analysis of the workload between diherent tasks (the workload is also 
a function of wavelength, e.g., diherent number of lines that overlap at each 
wavelength point) to obtain optimal load balancing. 


3.1 Wavelength Parallelization 


The parallelization of the computational workload outlined in the previous 
paragraphs requires synchronization between the radiative transfer tasks and 
the NLTE tasks, since the radiation held and the A* operator must be passed 
between them. In addition, our standard model calculations use 50 radial grid 
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points and as the number of nodes increases, so too does the communication 
and loop overhead, therefore, pure data parallelism does not deliver good seal- 
ability. We found good speedup up to about 5 nodes for a typical calculation, 
with the speedup close to the theoretical maximum. However, for 5 nodes the 
communication and loop overheads begin to become signiheant and it is not 
economical to use more than 10 nodes (depending on the machine and the 
model calculation, it might be necessary to use more nodes to £t the data in 
the memory available on a single node). 

Since the number of wavelength points in a calculation is very large and the 
CPU time scales linearly with the number of wavelength points, paralleliza¬ 
tion with respect to the wavelength points can lead to large speedups and 
to the ability to use very large numbers of nodes available on massively par¬ 
allel supercomputers. This poses no difficulties for static conhguration, but 
the coupling of the wavelengths points in expanding atmospheres makes the 
wavelength parallelization much more complex. 

We have developed a wavelength parallelization based on a toroidal topology 
that uses the concept of wavelength “clusters” to distribute a set of wavelength 
points (for the solution of the wavelength dependent radiative transfer) onto 
a different set of nodes, see Fig. 3 [9]. In order to achieve optimal load balance 
and, more importantly, in order to minimize the memory requirements, each 
cluster (a column of nodes indicated in Fig. 3) works on a single wavelength 
point at any given time. Each cluster can consist of a number of “worker” 
nodes where the worker node group uses parallelization methods discussed 
above (see also Ref. [8]). In order to avoid communication overhead, we use 
symmefnc wavelength clusters: each “row” of worker nodes in Fig. 3 performs 
identical tasks but on a different set of wavelength points for each cluster. 
We thus arrange the total number of nodes iV in a rectangular matrix with n 
columns and m rows, where n is the number of clusters and m is the number 
of workers for each cluster, such that N = n* m. Another way of visualizing 
this parallelization technique is to consider each wavelength cluster as a single 
entity (although not a single node or CPU) that performs a variety of differ¬ 
ent tasks at each wavelength point. The entity (cluster) itself is then further 
subdivided into individual nodes or CPUs each of which perform a given set 
of tasks at a particular wavelength point. This topology can be implemented 
very efficiently in the context of the MPI library, see [8] for details. 

For a static model atmosphere, all wavelengths and thus wavelength clusters 
are completely independent and execute in parallel with no immediate com¬ 
munication or synchronization along the rows of Fig. 3. Communication is 
only necessary after the calculation is complete for all wavelengths points on 
all nodes to collect, e.g., the rates and rate operators. Therefore, the speedup 
is close (80%) to the theoretical maximum, limited only by to combined 10 
bandwidth of the machine used. 
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In order to parallelize the spectrum calculations for a model atmosphere with 
a global velocity field, such as the expanding atmospheres of novae, supernovae 
or stellar winds, we need to take the mathematical character of the RTE into 
account. For monotonic velocity helds, the RTE is an initial value problem in 
wavelength (with the initial condition at the smallest wavelength for expand¬ 
ing atmospheres and at the largest wavelength for contracting atmospheres). 
This initial value problem must be discretized fully implicitly to ensure sta¬ 
bility. In the simplest case of a first order discretization, the solution of the 
RTE for wavelength point i depends only on the results of the point i — 1. In 
order to parallelize the spectrum calculations, the wavelength cluster n* com¬ 
puting the solution for wavelength point i must know the specihc intensities 
from the cluster rii-i computing the solution for point i — 1. This suggests a 
“pipeline” solution to the wavelength parallelization, transforming the “ma¬ 
trix” arrangement of nodes into a “torus” arrangement where data are sent 
along the torus’ circumference. Note that only the solution of the RTE is af¬ 
fected by this, the calculation of the opacities and rates remains independent 
between different wavelength clusters. In this case, the wavelength paralleliza¬ 
tion works as follows: Each cluster can independently compute the opacities 
and start the RT calculations (e.g., the A* calculations, hereafter called the 
pre-processing phase), it then waits until it receives the specihc intensities for 
the previous wavelength point, then it hnishes the solution of the RTE and 
immediately sends the results to the wavelength cluster calculating the next 
wavelength point (to minimize waiting time, this is done with non-blocking 
send/receives), then proceeds to calculate the rates etc. (hereafter called the 
post-processing phase and the new opacities for its next wavelength point and 
so on. 

The important point in this scheme is that each wavelength cluster can execute 
the post-processing phase of its current wavelength point and pre-processing 
phase of its next wavelength point independently and in parallel with all other 
clusters. This means that the majority of the total computational work can 
be done in parallel, leading to a substantial reduction in wall-clock time per 
model. Ideal load balancing can be obtained by dynamically allocating wave¬ 
length points to wavelength clusters. This requires only primitive logic with 
no measurable overhead, however it requires also communication and an ar¬ 
bitration/synchronization process to avoid deadlocks. Typically, the number 
of clusters n (4-64) is much smaller than the number of wavelength points, 
TTwi ~ 300, 000, so that at any given time the work required for each wave¬ 
length point is roughly the same for each cluster (the work changes as the 
number of overlapping lines changes, for example). Therefore, a simple round 
robin allocation of wavelength points to clusters (cluster i calculates wave¬ 
length points i, n + i, 2n + i and so on) can be used which will result in nearly 
optimal performance if the condition n n^\ is fulhlled. However, once the 
pipeline is full, adding further wavelength clusters cannot decrease the time 
required for the calculations, setting a limit for the efficient “radius” of the 
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torus topology. However, this limit can be increased somewhat by increasing 
the number of worker nodes per wavelength cluster. 


3.1.1 Scaling Results 

For a simple supernova test calculation, we examine both the scaling and 
performance tradeoff of spatial versus wavelength parallelization. Figure 4 
presents the results of our timing tests for one iteration of a Type Ic su¬ 
pernova model atmosphere, with a model temperature Tmodei = 12,000 K 
(the observed luminosity is given by L = 47ri?^Tjnodei"^), characteristic velocity 
Vo = 10000 kms“^, 4666 NLTE levels, 163812 NLTE lines, 211680 LTE lines 
(for simplicity, all line prohle were assumed to be Gaussian), non-homogeneous 
abundances, and 260630 wavelength points. This is a typical test for produc¬ 
tion calculations and we have designed this test to have the highest potential 
for synchronization, I/O waiting, and swapping to reduce performance to sim¬ 
ulate a worst case scenario for the parallel performance. It is however, char¬ 
acteristic of the level of detail needed to accurately model supernovae. This 
calculation has also been designed to barely £t into the memory of a single 
node. The behavior of the speedup is very similar to the results obtained for 
test case using a model of a nova explosion [9]. The “saturation point” at which 
the wavelength pipeline fills and no further speedup can be obtained if more 
wavelength clusters are used for the machines used here, occurs at about 5 to 
8 clusters. More clusters will not lead to larger speedups, as expected. Larger 
speedups can be obtained by using more worker nodes per cluster, which also 
drastically reduces the amount of memory required on each node. 


4 Discussion and Conclusions 


We have presented our approach to the numerical solution to the generalized 
stellar atmosphere problem in the presence of rapidly expanding flows. We 
have shown how the use of accelerated A operators may result in the for¬ 
mulation of the problem in such a way that extremely detailed model atoms 
may be handled in NLTE and the problem can be parallelized in a way that 
significantly reduces the per processor memory and CPU requirements with 
modest communication overhead. Parallelization also allows much more com¬ 
plex models to be calculations by giving us access to the large memory sizes 
that are available on modern parallel supercomputers. Currently, our largest 
model calculations involve 6000 atomic NLTE level with 65000 primary NLTE 
lines that are modeled individually, 2-10 million weak atomic secondary NLTE 
and LTE background lines and, for models of cool stellar winds, 150 million 
molecular lines. Simulations of this size and level of detail were simply not 
possible before the development of new radiative transfer algorithms and the 
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availability of parallel supercomputers. We believe that the next step — the 
computation of moving flows in three spatial dimensions, is becoming tractable 
on modern parallel supercomputers. There continues to be an urgent need for 
improvements in the fundamental atomic data which serves as input to these 
calculations. 


Acknowledgement 


We thank our many collaborators who have contributed to the development 
of PHOENIX, in particular we would like to thank France Allard, David Branch, 
Steve Shore, Sumner Starrheld, Jason Aufdenberg, and Andreas Schweitzer. 
This work was snpported in part by NASA ATP grant NAG 5-3018 and LTSA 
grant NAG 5-3619 and NSF grant AST-9720804 to the University of Geor¬ 
gia, and by NSF grant AST-9417242, NASA grant NAG5-3505 and an IBM 
SUR grant to the University of Oklahoma. Some of the calculations presented 
in this paper were performed on the IBM SP2 and SGI Origin 2000 of the 
UGA UGNS, at the San Diego Supercompnter Genter (SDSG), the Gornell 
Theory Genter (GTG), and at the National Genter for Super computing Ap¬ 
plications (NGSA), with snpport from the National Science Foundation, and 
at the NERSG with snpport from the DoE. We thank all these institutions 
for a generous allocation of computer time. 


References 

[1] Hauschildt, P. H. JQSRT 47, (1992) 433. 

[2] Hauschildt, P. H. JQSRT 50, (1993) 301. 

[3] Hauschildt, P. H. and Baron, E. JQSRT 54, (1995) 987. 

[4] Hauschildt, P. H., Starrheld, S., Shore, S. N., Allard, F., and Baron, E. ApJ 
447, (1995) 829. 

[5] Allard, F. and Hauschildt, P. H. ApJ 445, (1995) 433. 

[6] Hauschildt, P. H., Baron, E., Starrheld, S., and Allard, F. ApJ 462, (1996) 
386. 

[7] Baron, E., Hauschildt, P. H., Nugent, P., and Branch, D. MNRAS 283, (1996) 
297. 

[8] Hauschildt, P. H., Baron, E., and Allard, F. ApJ 483, (1997) 390. 

[9] Baron, E. and Hauschildt, P. H. Parallel implementation of the PHOENIX 
generalized stellar atmosphere program. H: Wavelength parallelization. ApJ 
495, (1998) 370. 


25 



[10] Message Passing Interface Forum. MPI: A Message-Passing Interfaee Standard, 
Version 1.1 (Univ. of Tennessee, Knoxville, TN, 1995). 

[11] Mihalas, D. Stellar Atmospheres (W. H. Freeman, New York, 1978), second 
edn. 

[12] Magnan, C. JQSRT 10 , (1970) 1. 

[13] Caroff, L. J., Noerdlinger, P. D., and Scargle, J. D. ApJ 176 , (1972) 439. 

[14] Auer, L. H. and Blerkom, D. V. ApJ 178, (1972) 175. 

[15] Castor, J. I. MNRAS 149, (1970) 111. 

[16] Mihalas, D., Kunasz, P., and Hummer, D. ApJ 202, (1975) 465. 

[17] Hauschildt, P. H. and Wehrse, R. JQSRT 46, (1991) 81. 

[18] Hauschildt, P. H., Best, M., and Wehrse, R. A&A 247 , (1991) L21. 

[19] Cannon, C. J. JQSRT 13, (1973) 627. 

[20] Scharmer, G. B. In W. Kalkofen, ed.. Methods in Radiative Transfer 
(Cambridge Univ. Press, Cambridge, 1984) p. 173, p. 173. 

[21] Olson, G. L., Auer, L. H., and Buchler, J. R. JQSRT 38, (1987) 431. 

[22] Hamann, W.-R. In Kalkofen [36] p. 35, p. 35. 

[23] Werner, K. In Kalkofen [36] p. 67, p. 67. 

[24] Olson, G. L. and Kunasz, P. B. JQSRT 38 , (1987) 325. 

[25] Hauschildt, P. H., Storzer, H., and Baron, E. JQSRT 51 , (1994) 875. 

[26] Mihalas, D. and Mihalas, B. W. Foundations of Radiation Hydrodynamies 
(Oxford University, Oxford, 1984). 

[27] Mihalas, D. ApJ 237 , (1980) 574. 

[28] Golub, G. H. and Loan, C. V. Matrix eomputations (The Johns Hopkins 
University Press, 1989). 

[29] Turek, S. preprint, (1993). 

[30] Ng, K. C. J. Chem. Phys. 61 , (1974) 2680. 

[31] Vinsome, P. W. In Proc. of the 4th Symp on Reservoir Simulation. Soc. of 
Petroleum Engineers, (1976) p. 149, p. 149. 

[32] Auer, L. In L. Grivellari, I. Hubeny, and D. G. Hummer, eds.. Stellar 
Atmospheres: Beyond Classieal Models (NATO ASI Series, Kluwer, Dordrecht, 
1991) p. 9, p. 9. 

[33] Rybicki, G. B. and Hummer, D. G. A&A 245, (1991) 171. 

[34] Mihalas, D. Stellar Atmospheres (W. H. Ereeman, New York, 1970), first edn. 


26 



[35] Unsold, A. Physik der Sternatmosphdren (Springer Verlag, Heidelberg, 1968), 
second edn. 

[36] Kalkofen, W., ed. Cambridge, (1987). Cambridge Univ. Press. 


27 



Fig. 1. Relation between (some of) the physical and mathematical blocks that de¬ 
scribe the physics of a stellar atmosphere. In order to calculate a model atmosphere, 
a set of value of the physical variables, e.g., temperatures, densities, population 
densities and the radiation field, must be found that satisfies all constraints simul¬ 
taneously. 


Fig. 2. Relations between the main types of variables represented by blocks are 
indicated. The labels name the equations that relate the block to each other. 


Fig. 3. The basic “torus” design of the wavelength-parallelized version of PHOENIX: 
groups of processors are divided up into wavelength clusters which will work on 
individual wavelength points, each wavelength cluster is further divided into a set 
of worker nodes, where each worker node is assigned a set of specific tasks, e.g., it 
will work on the LTE background line opacity for a set of radial points. Our design 
requires that each worker node on all wavelength clusters work on exactly the same 
set of tasks, although additional inherently serial operations can be assigned to one 
particular master worker, or master wavelength cluster. This reduces communication 
between clusters to its absolute minimum and allows the maximum speedup. 


Fig. 4. Scalability of the Supernova model atmosphere test run as function of the 
number of nodes (processing elements or nodes) used. The y-axis gives the speedup 
obtained relative to the serial run. The different symbols show the results for dif¬ 
ferent numbers of worker tasks for each wavelength cluster. 



